We’ve shown that clone-correction can reduce the power of \(\bar{r}_d\) to detect clonal reproduction, but it’s also clear that a reduced sample size contributes to this as well. To disentangle this, we assess the \(\bar{r}_d\) from 1000 replicates of n samples where n is the size of the data set after clone-correction. The data were processed via the analyze_jackknife_ia.R script. These were saved in the jack_rda_files/ directory.
library('zksimanalysis')
res <- load(here::here("data", "full_results.rda")) # load original values of rd
jack_files <- list.files(here::here("data", "jack_rda_files/"), full.names = TRUE) # load jackknife simulations
jacklist <- setNames(jack_files, basename(jack_files))
for (i in jack_files){
jacklist[basename(i)] <- load(i)
}
jackdat <- jacklist %>%
map_df(get, .id = "source") %>%
mutate(mutation_rate = ifelse(grepl("mutant", source), "even", "uneven")) %>%
mutate(source = gsub("(^.+?feather).*$", "\\1.DATA.rda", source)) %>%
select(matches("jack"), everything())
rm(list = jacklist)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 5858742 312.9 9968622 532.4 6208511 331.6
## Vcells 1360016731 10376.2 1833074574 13985.3 1361919369 10390.7
# taking stock of how much data we have
jackdat %>% filter(!is.na(pop))
## # A tibble: 55,510 x 7
## jack.p.Ia jack.p.rD jack.ia jack.rd
## <dbl> <dbl> <list> <list>
## 1 0.425 0.244 <dbl [999]> <dbl [999]>
## 2 0.025 0.009 <dbl [999]> <dbl [999]>
## 3 0.022 0.006 <dbl [999]> <dbl [999]>
## 4 0.001 0.001 <dbl [999]> <dbl [999]>
## 5 0.129 0.129 <dbl [999]> <dbl [999]>
## 6 0.187 0.204 <dbl [999]> <dbl [999]>
## 7 0.027 0.018 <dbl [999]> <dbl [999]>
## 8 0.046 0.035 <dbl [999]> <dbl [999]>
## 9 0.120 0.116 <dbl [999]> <dbl [999]>
## 10 0.222 0.224 <dbl [999]> <dbl [999]>
## # ... with 55,500 more rows, and 3 more variables: source <chr>,
## # pop <chr>, mutation_rate <chr>
get(res)
## # A tibble: 80,000 x 54
## Ia p.Ia rbarD p.rD samples.ia samples.rd Iacc
## <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl>
## 1 1.548209 0.001 0.1639916 0.001 <dbl [999]> <dbl [999]> 1.306757
## 2 2.313435 0.001 0.2487179 0.001 <dbl [999]> <dbl [999]> 2.372274
## 3 1.983252 0.001 0.2038003 0.001 <dbl [999]> <dbl [999]> 1.426773
## 4 1.697867 0.001 0.1683523 0.001 <dbl [999]> <dbl [999]> 1.341157
## 5 13.925948 0.001 0.8207048 0.001 <dbl [999]> <dbl [999]> 13.646827
## 6 14.187937 0.001 0.8293665 0.001 <dbl [999]> <dbl [999]> 14.152408
## 7 13.850928 0.001 0.7843280 0.001 <dbl [999]> <dbl [999]> 13.137558
## 8 13.986465 0.001 0.7940675 0.001 <dbl [999]> <dbl [999]> 13.966603
## 9 12.906462 0.001 0.6736799 0.001 <dbl [999]> <dbl [999]> 12.865542
## 10 13.099398 0.001 0.6849428 0.001 <dbl [999]> <dbl [999]> 12.912790
## # ... with 79,990 more rows, and 47 more variables: p.Iacc <dbl>,
## # rbarDcc <dbl>, p.rDcc <dbl>, samples.iacc <list>, samples.rdcc <list>,
## # pop <chr>, run <chr>, seed <chr>, sexrate <chr>, gen <chr>, rep <chr>,
## # sample <fctr>, mutation_rate <chr>, H <dbl>, G <dbl>, lambda <dbl>,
## # E.5 <dbl>, B <dbl>, uSimp <dbl>, NMLG <dbl>, H.var <dbl>, G.var <dbl>,
## # lambda.var <dbl>, E.5.var <dbl>, B.var <dbl>, uSimp.var <dbl>,
## # NMLG.var <dbl>, allele <dbl>, `1-D` <dbl>, Hexp <dbl>, Evenness <dbl>,
## # tab <list>, allelecc <dbl>, `1-Dcc` <dbl>, Hexpcc <dbl>,
## # Evennesscc <dbl>, tabcc <list>, H.locus <dbl>, G.locus <dbl>,
## # lambda.locus <dbl>, E.5.locus <dbl>, uSimp.locus <dbl>, rrmat <list>,
## # curve <list>, pgen <list>, CF <dbl>, source <chr>
Note that above, I’m filtering jackdat for non-missing populations. I should explain this a bit. When jack.ia was put into poppr, the requirements for running that n was greater than 2 and smaller than the total popualtion size. If the number of MLGs did not fit this requirement, tidy_jackia would return all NA, including that for population.
We can’t ask any questions regarding bias with missing data, so we are filtering it out here.
# combining original values and jackknife simulations
vals <- get(res) %>%
inner_join(filter(jackdat, !is.na(pop)), # filtering out missing data
by = c("source", "mutation_rate", "pop")) %>%
select(rbarD, jack.rd, jack.p.rD, mutation_rate, everything())
vals
## # A tibble: 55,510 x 58
## rbarD jack.rd jack.p.rD mutation_rate Ia p.Ia p.rD
## <dbl> <list> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 0.1639916 <dbl [999]> 0.179 uneven 1.548209 0.001 0.001
## 2 0.2487179 <dbl [999]> 0.594 uneven 2.313435 0.001 0.001
## 3 0.2038003 <dbl [999]> 0.008 uneven 1.983252 0.001 0.001
## 4 0.1683523 <dbl [999]> 0.011 uneven 1.697867 0.001 0.001
## 5 0.8207048 <dbl [999]> 0.308 uneven 13.925948 0.001 0.001
## 6 0.8293665 <dbl [999]> 0.470 uneven 14.187937 0.001 0.001
## 7 0.7843280 <dbl [999]> 0.023 uneven 13.850928 0.001 0.001
## 8 0.7940675 <dbl [999]> 0.433 uneven 13.986465 0.001 0.001
## 9 0.6736799 <dbl [999]> 0.725 uneven 12.906462 0.001 0.001
## 10 0.6849428 <dbl [999]> 0.427 uneven 13.099398 0.001 0.001
## # ... with 55,500 more rows, and 51 more variables: samples.ia <list>,
## # samples.rd <list>, Iacc <dbl>, p.Iacc <dbl>, rbarDcc <dbl>,
## # p.rDcc <dbl>, samples.iacc <list>, samples.rdcc <list>, pop <chr>,
## # run <chr>, seed <chr>, sexrate <chr>, gen <chr>, rep <chr>,
## # sample <fctr>, H <dbl>, G <dbl>, lambda <dbl>, E.5 <dbl>, B <dbl>,
## # uSimp <dbl>, NMLG <dbl>, H.var <dbl>, G.var <dbl>, lambda.var <dbl>,
## # E.5.var <dbl>, B.var <dbl>, uSimp.var <dbl>, NMLG.var <dbl>,
## # allele <dbl>, `1-D` <dbl>, Hexp <dbl>, Evenness <dbl>, tab <list>,
## # allelecc <dbl>, `1-Dcc` <dbl>, Hexpcc <dbl>, Evennesscc <dbl>,
## # tabcc <list>, H.locus <dbl>, G.locus <dbl>, lambda.locus <dbl>,
## # E.5.locus <dbl>, uSimp.locus <dbl>, rrmat <list>, curve <list>,
## # pgen <list>, CF <dbl>, source <chr>, jack.p.Ia <dbl>, jack.ia <list>
First, let’s see what we have after the filtering:
vals %>%
group_by(mutation_rate, sexrate, sample) %>%
summarize(n = n()) %>%
spread(sample, n) %>%
knitr::kable()
| mutation_rate | sexrate | 10 | 25 | 50 | 100 |
|---|---|---|---|---|---|
| even | 0.0000 | 969 | 998 | 999 | 1000 |
| even | 0.0001 | 979 | 998 | 1000 | 1000 |
| even | 0.0005 | 989 | 1000 | 1000 | 1000 |
| even | 0.0010 | 963 | 1000 | 1000 | 1000 |
| even | 0.0050 | 583 | 996 | 1000 | 1000 |
| even | 0.0100 | 376 | 951 | 1000 | 1000 |
| even | 0.0500 | 107 | 492 | 944 | 1000 |
| even | 0.1000 | 42 | 281 | 727 | 996 |
| even | 0.5000 | 7 | 42 | 133 | 455 |
| even | 1.0000 | NA | NA | NA | 2 |
| uneven | 0.0000 | 934 | 999 | 999 | 1000 |
| uneven | 0.0001 | 931 | 999 | 1000 | 1000 |
| uneven | 0.0005 | 882 | 1000 | 1000 | 1000 |
| uneven | 0.0010 | 810 | 1000 | 1000 | 1000 |
| uneven | 0.0050 | 479 | 986 | 1000 | 1000 |
| uneven | 0.0100 | 347 | 927 | 1000 | 1000 |
| uneven | 0.0500 | 103 | 483 | 935 | 1000 |
| uneven | 0.1000 | 42 | 280 | 722 | 995 |
| uneven | 0.5000 | 7 | 42 | 128 | 451 |
Now, let’s visualize the “p-value” for each jack knife analysis. This p-value represents the fraction of jack-knife simulations less than or equal to the clone-corrected value.
In essence, we are asking the following question:
Can the clone-corrected value of \(\bar{r}_d\) be attributed to a small sample size?
vals %>%
ggplot(aes(x = jack.p.rD, group = mutation_rate, fill = mutation_rate)) +
geom_histogram(alpha = 0.5, binwidth = 0.05) +
facet_grid(sexrate~sample, scales = "free")
Let’s take a look at the values when the pvalue for \(\bar{r}_d\) is < 0.05:
vals %>%
filter(p.rD < p.rDcc, p.rDcc > 0.05) %>%
group_by(sexrate, sample) %>%
summarize(reject = sprintf("%.2f (%d)", sum(jack.p.rD < 0.05)/n(), n())) %>%
spread(sample, reject)
## # A tibble: 10 x 5
## # Groups: sexrate [10]
## sexrate `10` `25` `50` `100`
## * <chr> <chr> <chr> <chr> <chr>
## 1 0.0000 0.11 (331) 0.60 (304) 0.64 (272) 0.67 (279)
## 2 0.0001 0.12 (281) 0.56 (218) 0.59 (217) 0.61 (215)
## 3 0.0005 0.29 (168) 1.00 (22) 1.00 (8) 1.00 (2)
## 4 0.0010 0.24 (337) 1.00 (71) 1.00 (26) 1.00 (10)
## 5 0.0050 0.03 (748) 0.91 (911) 0.98 (568) 0.99 (329)
## 6 0.0100 0.01 (617) 0.75 (1362) 0.95 (1159) 0.98 (900)
## 7 0.0500 0.00 (201) 0.26 (893) 0.50 (1668) 0.64 (1734)
## 8 0.1000 0.00 (74) 0.12 (510) 0.32 (1280) 0.42 (1713)
## 9 0.5000 0.00 (13) 0.00 (78) 0.13 (233) 0.15 (675)
## 10 1.0000 <NA> <NA> <NA> 0.00 (2)
vals %>%
filter(p.rD < p.rDcc, p.rDcc > 0.05) %>%
ggplot(aes(x = jack.p.rD, group = mutation_rate, fill = mutation_rate)) +
geom_histogram(alpha = 0.5, binwidth = 0.05) +
facet_grid(sexrate~sample, scales = "free_y")
Yeah, those don’t look too good. The graph above is basically saying that we would consider most of the clonal populations to be sexually derived because the clone-corrected value is lower than the distribution. Of course, this is expecting that the mean value for the distribution is near the observed value of the whole data set. This may not be true.
To correct for this, I’m going to calculate the bias a la Section 10.3 in Efron and Tibshirani by subtracting the observed value from the bootstrap estimate:
\[ \hat\theta_B = \frac{1}{B} \sum\theta^*\\ Bias[\hat\theta] = \hat\theta_B - \hat\theta \]
Once I have the bias, I’m going to compare the bias-corrected value of the clone-corrected \(\bar{r}_d\) to the null distribution of the full data set. This would allow us to see if the reduced sample size affected the data (good example from http://www.stat.umn.edu/geyer/5601/examp/bias.html).
Note to future Zhian:
The answer may not be as simple as accounting for bias due to sample size, but also bias due to duplicated samples. Perhaps the way to go is to subtract the sum of the biases from the data.
makep <- function(x, y) (sum(x <= y, na.rm = TRUE) + 1)/(length(y) + 1)
bvals <- vals %>%
filter(p.rD < p.rDcc, p.rDcc > 0.05) %>%
mutate(jack.mean = map_dbl(jack.rd, mean, na.rm = TRUE)) %>%
mutate(bias = jack.mean - rbarD) %>% # calculating bias a la Effron
select(bias, jack.mean, everything()) %>%
mutate(jack.p.rDb = map2_dbl(rbarDcc - bias, samples.rd, makep)) %>%
select(bias, jack.p.rD, jack.p.rDb, everything())
Now we can visualize this as a table of the percentage rejection.
bvals %>%
group_by(sexrate, sample) %>%
summarize(reject = sprintf("%.2f (%.4f)",
sum(jack.p.rDb < 0.05)/n(),
median(bias, na.rm = TRUE))) %>%
spread(sample, reject) %>%
knitr::kable(digits = 3, caption = "Null Hypothesis Rejected (median bias)")
| sexrate | 10 | 25 | 50 | 100 |
|---|---|---|---|---|
| 0.0000 | 0.03 (0.0388) | 0.02 (0.0369) | 0.01 (0.0368) | 0.00 (0.0381) |
| 0.0001 | 0.07 (0.0366) | 0.02 (0.0233) | 0.02 (0.0296) | 0.00 (0.0316) |
| 0.0005 | 0.02 (0.0273) | 0.00 (0.0215) | 0.00 (0.0154) | 0.50 (0.0126) |
| 0.0010 | 0.05 (0.0093) | 0.08 (0.0091) | 0.27 (0.0067) | 0.20 (0.0072) |
| 0.0050 | 0.03 (0.0014) | 0.09 (0.0005) | 0.21 (0.0005) | 0.35 (0.0006) |
| 0.0100 | 0.02 (0.0011) | 0.04 (0.0002) | 0.09 (0.0002) | 0.16 (0.0002) |
| 0.0500 | 0.00 (0.0009) | 0.01 (0.0000) | 0.01 (0.0000) | 0.02 (0.0000) |
| 0.1000 | 0.03 (0.0009) | 0.00 (0.0000) | 0.01 (0.0000) | 0.01 (0.0000) |
| 0.5000 | 0.00 (0.0008) | 0.00 (-0.0000) | 0.00 (0.0000) | 0.00 (-0.0000) |
| 1.0000 | NA | NA | NA | 0.00 (0.0000) |
bvals %>%
ggplot(aes(x = p.rDcc, group = mutation_rate, fill = mutation_rate)) +
geom_density(alpha = 0.75) +
geom_rug(alpha = 0.5) +
facet_grid(sexrate~sample, scales = "free_y") +
scale_fill_viridis(option = "A", discrete = TRUE) +
ggtitle("Clone-censored p-values without bias correction") +
xlab("P-value")
bvals %>%
ggplot(aes(x = jack.p.rDb, group = mutation_rate, fill = mutation_rate)) +
geom_density(alpha = 0.75) +
geom_rug(alpha = 0.5) +
facet_grid(sexrate~sample, scales = "free_y") +
scale_fill_viridis(option = "D", discrete = TRUE) +
ggtitle("P-values after bias correction of clone-censored data") +
xlab("P-value (bias-corrected)")
bvals %>%
unite(the_group, seed, rep, run, gen) %>%
ggplot(aes(x = sexrate, y = bias, fill = mutation_rate)) +
geom_line(aes(group = the_group), alpha = 0.2) +
geom_boxplot(aes(group = factor(sexrate))) +
# geom_point(aes(group = the_group), alpha = 0.2) +
facet_grid(sample ~ mutation_rate)
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
Well, it turns out that things got worse, folks. Perhaps the answer is that we need to consider the fact that there is over-representation of the clones. If we parameterize the resampling procedure by weighting the samples by the value of \(p_{sex}\), then we can account for this over-representation. Consider that it is possible for a clonal genotype to be identical in state without being identical by descent. This is roughly what \(p_{sex}\) measures. When we weight the resamplings with \(p_{sex}\), it’s more likely for us to grab a smaller population that’s representative of what the underlying genotypes are. We can use that to calculate the bias due to smaller sample size and use that to correct the clone-corrected estimate.
I went back and created new analyses, weighted by \(p_{sex}\). To avoid wasting memory, I’m going to remove the unnecessary data we have thus far:
rm(vals, jackdat)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 5885139 314.4 9968622 532.4 9968622 532.4
## Vcells 1286750751 9817.2 2199769488 16783.0 1440601473 10991.0
Now, I can load in the new results:
jack_files <- list.files(here::here("data", "jack_psex_rda_files/"), full.names = TRUE) # load jackknife simulations
jacklist <- setNames(jack_files, basename(jack_files))
for (i in jack_files){
jacklist[basename(i)] <- load(i)
}
jackdat <- jacklist %>%
map_df(get, .id = "source") %>%
mutate(mutation_rate = ifelse(grepl("mutant", source), "even", "uneven")) %>%
mutate(source = gsub("(^.+?feather).*$", "\\1.DATA.rda", source)) %>%
select(matches("jack"), everything())
rm(list = jacklist)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 6045854 322.9 9968622 532.4 9968622 532.4
## Vcells 1398271142 10668.0 2199769488 16783.0 1440601473 10991.0
# taking stock of how much data we have
jackdat %>% filter(!is.na(pop))
## # A tibble: 55,510 x 7
## jack.p.Ia jack.p.rD jack.ia jack.rd
## <dbl> <dbl> <list> <list>
## 1 1 1 <dbl [999]> <dbl [999]>
## 2 1 1 <dbl [999]> <dbl [999]>
## 3 1 1 <dbl [999]> <dbl [999]>
## 4 1 1 <dbl [999]> <dbl [999]>
## 5 1 1 <dbl [999]> <dbl [999]>
## 6 1 1 <dbl [999]> <dbl [999]>
## 7 1 1 <dbl [999]> <dbl [999]>
## 8 1 1 <dbl [999]> <dbl [999]>
## 9 1 1 <dbl [999]> <dbl [999]>
## 10 1 1 <dbl [999]> <dbl [999]>
## # ... with 55,500 more rows, and 3 more variables: source <chr>,
## # pop <chr>, mutation_rate <chr>
# combining original values and jackknife simulations
vals <- get(res) %>%
inner_join(filter(jackdat, !is.na(pop)), # filtering out missing data
by = c("source", "mutation_rate", "pop")) %>%
select(rbarD, jack.rd, jack.p.rD, mutation_rate, everything())
vals
## # A tibble: 55,510 x 58
## rbarD jack.rd jack.p.rD mutation_rate Ia p.Ia p.rD
## <dbl> <list> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 0.1639916 <dbl [999]> 1 uneven 1.548209 0.001 0.001
## 2 0.2487179 <dbl [999]> 1 uneven 2.313435 0.001 0.001
## 3 0.2038003 <dbl [999]> 1 uneven 1.983252 0.001 0.001
## 4 0.1683523 <dbl [999]> 1 uneven 1.697867 0.001 0.001
## 5 0.8207048 <dbl [999]> 1 uneven 13.925948 0.001 0.001
## 6 0.8293665 <dbl [999]> 1 uneven 14.187937 0.001 0.001
## 7 0.7843280 <dbl [999]> 1 uneven 13.850928 0.001 0.001
## 8 0.7940675 <dbl [999]> 1 uneven 13.986465 0.001 0.001
## 9 0.6736799 <dbl [999]> 1 uneven 12.906462 0.001 0.001
## 10 0.6849428 <dbl [999]> 1 uneven 13.099398 0.001 0.001
## # ... with 55,500 more rows, and 51 more variables: samples.ia <list>,
## # samples.rd <list>, Iacc <dbl>, p.Iacc <dbl>, rbarDcc <dbl>,
## # p.rDcc <dbl>, samples.iacc <list>, samples.rdcc <list>, pop <chr>,
## # run <chr>, seed <chr>, sexrate <chr>, gen <chr>, rep <chr>,
## # sample <fctr>, H <dbl>, G <dbl>, lambda <dbl>, E.5 <dbl>, B <dbl>,
## # uSimp <dbl>, NMLG <dbl>, H.var <dbl>, G.var <dbl>, lambda.var <dbl>,
## # E.5.var <dbl>, B.var <dbl>, uSimp.var <dbl>, NMLG.var <dbl>,
## # allele <dbl>, `1-D` <dbl>, Hexp <dbl>, Evenness <dbl>, tab <list>,
## # allelecc <dbl>, `1-Dcc` <dbl>, Hexpcc <dbl>, Evennesscc <dbl>,
## # tabcc <list>, H.locus <dbl>, G.locus <dbl>, lambda.locus <dbl>,
## # E.5.locus <dbl>, uSimp.locus <dbl>, rrmat <list>, curve <list>,
## # pgen <list>, CF <dbl>, source <chr>, jack.p.Ia <dbl>, jack.ia <list>
vals %>%
group_by(mutation_rate, sexrate, sample) %>%
summarize(n = n()) %>%
spread(sample, n) %>%
knitr::kable()
| mutation_rate | sexrate | 10 | 25 | 50 | 100 |
|---|---|---|---|---|---|
| even | 0.0000 | 969 | 998 | 999 | 1000 |
| even | 0.0001 | 979 | 998 | 1000 | 1000 |
| even | 0.0005 | 989 | 1000 | 1000 | 1000 |
| even | 0.0010 | 963 | 1000 | 1000 | 1000 |
| even | 0.0050 | 583 | 996 | 1000 | 1000 |
| even | 0.0100 | 376 | 951 | 1000 | 1000 |
| even | 0.0500 | 107 | 492 | 944 | 1000 |
| even | 0.1000 | 42 | 281 | 727 | 996 |
| even | 0.5000 | 7 | 42 | 133 | 455 |
| even | 1.0000 | NA | NA | NA | 2 |
| uneven | 0.0000 | 934 | 999 | 999 | 1000 |
| uneven | 0.0001 | 931 | 999 | 1000 | 1000 |
| uneven | 0.0005 | 882 | 1000 | 1000 | 1000 |
| uneven | 0.0010 | 810 | 1000 | 1000 | 1000 |
| uneven | 0.0050 | 479 | 986 | 1000 | 1000 |
| uneven | 0.0100 | 347 | 927 | 1000 | 1000 |
| uneven | 0.0500 | 103 | 483 | 935 | 1000 |
| uneven | 0.1000 | 42 | 280 | 722 | 995 |
| uneven | 0.5000 | 7 | 42 | 128 | 451 |
vals %>%
ggplot(aes(x = jack.p.rD, group = mutation_rate, fill = mutation_rate)) +
geom_histogram(alpha = 0.5, binwidth = 0.05) +
facet_grid(sexrate~sample, scales = "free") +
ggtitle("P-value from jack-knife analysis weighted by psex")
Okay this result is almost the complete opposite as compared to the one above. Let’s take a look at the values when the pvalue for \(\bar{r}_d\) is < 0.05:
vals %>%
filter(p.rD < p.rDcc, p.rDcc > 0.05) %>%
group_by(sexrate, sample) %>%
summarize(reject = sprintf("%.2f (%d)", sum(jack.p.rD < 0.05)/n(), n())) %>%
spread(sample, reject)
## # A tibble: 10 x 5
## # Groups: sexrate [10]
## sexrate `10` `25` `50` `100`
## * <chr> <chr> <chr> <chr> <chr>
## 1 0.0000 0.00 (331) 0.00 (304) 0.00 (272) 0.00 (279)
## 2 0.0001 0.00 (281) 0.00 (218) 0.00 (217) 0.00 (215)
## 3 0.0005 0.00 (168) 0.00 (22) 0.00 (8) 0.00 (2)
## 4 0.0010 0.00 (337) 0.00 (71) 0.00 (26) 0.00 (10)
## 5 0.0050 0.00 (748) 0.00 (911) 0.00 (568) 0.01 (329)
## 6 0.0100 0.00 (617) 0.00 (1362) 0.00 (1159) 0.00 (900)
## 7 0.0500 0.00 (201) 0.00 (893) 0.00 (1668) 0.00 (1734)
## 8 0.1000 0.00 (74) 0.00 (510) 0.00 (1280) 0.00 (1713)
## 9 0.5000 0.00 (13) 0.00 (78) 0.00 (233) 0.00 (675)
## 10 1.0000 <NA> <NA> <NA> 0.00 (2)
vals %>%
filter(p.rD < p.rDcc, p.rDcc > 0.05) %>%
ggplot(aes(x = jack.p.rD, group = mutation_rate, fill = mutation_rate)) +
geom_histogram(alpha = 0.5, binwidth = 0.05) +
facet_grid(sexrate~sample, scales = "free_y")
makep <- function(x, y) (sum(x <= y, na.rm = TRUE) + 1)/(length(y) + 1)
bvals <- vals %>%
filter(p.rD < p.rDcc, p.rDcc > 0.05) %>%
mutate(jack.mean = map_dbl(jack.rd, mean, na.rm = TRUE)) %>%
mutate(bias = jack.mean - rbarD) %>% # calculating bias a la Effron
select(bias, jack.mean, everything()) %>%
mutate(jack.p.rDb = map2_dbl(rbarDcc - bias, samples.rdcc, makep)) %>%
select(bias, jack.p.rD, jack.p.rDb, everything())
Now we can visualize this as a table of the percentage rejection.
bvals %>%
group_by(sexrate, sample) %>%
summarize(reject = sprintf("%.2f (%.4f)",
sum(jack.p.rDb < 0.01)/n(),
mean(bias[is.finite(bias)], na.rm = TRUE))) %>%
spread(sample, reject) %>%
knitr::kable(digits = 3, caption = "Null Hypothesis Rejected (median bias)")
| sexrate | 10 | 25 | 50 | 100 |
|---|---|---|---|---|
| 0.0000 | 0.51 (-0.2022) | 0.63 (-0.1126) | 0.68 (-0.0951) | 0.71 (-0.0831) |
| 0.0001 | 0.54 (-0.2280) | 0.59 (-0.1288) | 0.67 (-0.0933) | 0.70 (-0.0838) |
| 0.0005 | 0.91 (-0.1974) | 0.95 (-0.1903) | 1.00 (-0.1502) | 1.00 (-0.0677) |
| 0.0010 | 0.88 (-0.1274) | 0.99 (-0.1077) | 1.00 (-0.0974) | 1.00 (-0.0871) |
| 0.0050 | 0.38 (-0.0472) | 0.61 (-0.0250) | 0.83 (-0.0246) | 0.96 (-0.0241) |
| 0.0100 | 0.25 (-0.0417) | 0.24 (-0.0144) | 0.53 (-0.0132) | 0.78 (-0.0128) |
| 0.0500 | 0.09 (-0.0331) | 0.02 (-0.0062) | 0.01 (-0.0030) | 0.04 (-0.0027) |
| 0.1000 | 0.12 (-0.0358) | 0.01 (-0.0049) | 0.00 (-0.0019) | 0.00 (-0.0014) |
| 0.5000 | 0.31 (-0.0341) | 0.00 (-0.0047) | 0.00 (-0.0012) | 0.00 (-0.0004) |
| 1.0000 | NA | NA | NA | 0.00 (-0.0008) |
Now this is showing promise. To recap, this table is derived from the situations in which \(\bar{r}_d\) was significant for the full data set, but not for the clone-corrected data set. The values represented are the fraction of clone-corrected populations that rejected the null hypothesis after bias-correction.
A couple of notes about this table. I had to specify finite values for the bias because for sexrates less than 0.0005 and a sample size of 10, we saw some levels of bias less than -1, and a couple that were infinite. I’ll plot them here for your enjoyment.
infbias <- bvals %>% filter(bias < -1) %>%
select(pop, jack.rd, rbarDcc, NMLG)
infbias %>%
unnest() %>%
ggplot(aes(x = jack.rd)) +
geom_density(aes(fill = NMLG)) +
geom_rug() +
geom_vline(aes(xintercept = rbarDcc)) +
facet_wrap(~pop, ncol = 1)
## Warning: Removed 1999 rows containing non-finite values (stat_density).
What it’s showing is that when the number of MLGs are reduced to 3, the value of \(\bar{r}_d\) can be really negative!
Now back to the show.
bvals %>%
ggplot(aes(x = p.rDcc, group = mutation_rate, fill = mutation_rate)) +
geom_density(alpha = 0.75) +
geom_rug(alpha = 0.5) +
facet_grid(sexrate~sample, scales = "free_y") +
scale_fill_viridis(option = "A", discrete = TRUE) +
ggtitle("Clone-censored p-values without bias correction") +
xlab("P-value")
bvals %>%
ggplot(aes(x = jack.p.rDb, group = mutation_rate, fill = mutation_rate)) +
geom_density(alpha = 0.75) +
geom_rug(alpha = 0.5) +
facet_grid(sexrate~sample, scales = "free_y") +
scale_fill_viridis(option = "D", discrete = TRUE) +
ggtitle("P-values after bias correction of clone-censored data") +
xlab("P-value (bias-corrected)")
bvals %>%
unite(the_group, seed, rep, run, gen) %>%
ggplot(aes(x = sexrate, y = bias, fill = mutation_rate)) +
geom_line(aes(group = the_group), alpha = 0.2) +
geom_boxplot(aes(group = factor(sexrate))) +
# geom_point(aes(group = the_group), alpha = 0.2) +
facet_grid(sample ~ mutation_rate) +
ggtitle("Progression of bias", subtitle = "full data set compared to sub-sample")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
After thinking about this for a little bit, I realized that this method is flawed because the bias is greater if the clone-corrected data is closer to zero. An example can be shown with the “nancycats” data set in adegenet, documenting 17 colonies of cats in Nancy, France. Since we know cats don’t clone, we can trust that this data set comes from sexual organisms.
data("nancycats")
nan7 <- nancycats[pop = 7]
set.seed(999)
nan72 <- nan7[sample(nInd(nan7), nInd(nan7)*2, replace = TRUE)]
indNames(nan72) <- paste("sample", seq(nInd(nan72)))
nan72 <- as.genclone(nan72)
nan7ia <- ia(nan7, sample = 999, valuereturn = TRUE, quiet = TRUE, plot = FALSE)
nan72ia <- ia(nan72, sample = 999, valuereturn = TRUE, quiet = TRUE, plot = FALSE)
nan72ccia <- ia(clonecorrect(nan72, NA), sample = 999, valuereturn = TRUE, quiet = TRUE, plot = FALSE)
## Warning in clonecorrect(nan72, NA): Strata is not set for nan72. Clone
## correct will be performed without population information.
nanjack <- jack.ia(nan72, use_psex = TRUE, method = 'multiple', mul = 2, rep = 999, quiet = TRUE)
idx <- c("Ia", "rbarD")
nan72bi <- nan72ccia
# Calculating the bias-correction
nan72bi$index[idx] <- nan72ccia$index[idx] - map2_dbl(nanjack, nan72ia$index[idx], function(a, b) (mean(a) - b))
nan72bi$index[c(FALSE, TRUE)] <- map2_dbl(nan72bi$index[idx], nan72bi$samples, makep)
plot(nan7ia) + ggtitle("Original Data")
plot(nan72ia) + ggtitle(paste("Resampled Data with", nmll(nan72), "unique samples"))
plot(nan72ccia) + ggtitle(paste("Resampled Data (clone-corrected)"))
plot(nan72bi) + ggtitle(paste("Resampled Data (bias-corrected)"))
We would have declared this population incorrectly to be clonal, but if we simply used the mean of the jack-knifed samples (accounting for \(p_{sex}\)), 0.042, we would not have been able to reject it.
Here, we will use the mean of the jack-knifed samples as the estimate and re-calculate the p-value from there. Note that we are using the null distribution generated from the full data set to
jvals <- vals %>%
filter(p.rD < p.rDcc, p.rDcc > 0.05) %>%
mutate(jack.mean = map_dbl(jack.rd, mean, na.rm = TRUE)) %>%
mutate(bias = jack.mean - rbarDcc) %>%
mutate(jack.p = map2_dbl(jack.mean, samples.rd, makep)) %>% # against the full null
mutate(jack.pcc = map2_dbl(jack.mean, samples.rdcc, makep)) %>% # against the clone-corrected null
select(jack.pcc, everything())
What I show here are the results when compared against the full null distribution, which exhibits less variance than the clone-corrected null. When compared against the clone-corrected null, there were no cases where the null hypothesis was rejected at the 0.01 level.
There are three tables:
jvals %>%
group_by(sexrate, sample) %>%
summarize(reject = sum(jack.p <= 0.01, na.rm = TRUE)/n()) %>%
spread(sample, reject) %>%
knitr::kable(digits = 3, caption = "Null Hypothesis Rejected (p <= 0.01)")
| sexrate | 10 | 25 | 50 | 100 |
|---|---|---|---|---|
| 0.0000 | 0.066 | 0.112 | 0.103 | 0.172 |
| 0.0001 | 0.114 | 0.087 | 0.115 | 0.121 |
| 0.0005 | 0.137 | 0.227 | 0.375 | 0.500 |
| 0.0010 | 0.039 | 0.085 | 0.423 | 0.600 |
| 0.0050 | 0.000 | 0.000 | 0.005 | 0.061 |
| 0.0100 | 0.000 | 0.000 | 0.001 | 0.003 |
| 0.0500 | 0.000 | 0.000 | 0.000 | 0.000 |
| 0.1000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 0.5000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1.0000 | NA | NA | NA | 0.000 |
jvals %>%
group_by(sexrate, sample) %>%
summarize(reject = sum(jack.p <= 0.01 | Evennesscc > 0.85, na.rm = TRUE)/n()) %>%
spread(sample, reject) %>%
knitr::kable(digits = 3, caption = "Null Hypothesis Rejected (p <= 0.01 or E5A > 0.85)")
| sexrate | 10 | 25 | 50 | 100 |
|---|---|---|---|---|
| 0.0000 | 0.973 | 0.980 | 0.982 | 0.957 |
| 0.0001 | 0.964 | 0.972 | 0.991 | 0.991 |
| 0.0005 | 0.280 | 0.273 | 0.375 | 0.500 |
| 0.0010 | 0.089 | 0.085 | 0.423 | 0.600 |
| 0.0050 | 0.033 | 0.003 | 0.012 | 0.061 |
| 0.0100 | 0.036 | 0.008 | 0.006 | 0.006 |
| 0.0500 | 0.025 | 0.001 | 0.002 | 0.001 |
| 0.1000 | 0.054 | 0.002 | 0.002 | 0.003 |
| 0.5000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1.0000 | NA | NA | NA | 0.000 |
jvals %>%
group_by(sexrate, sample) %>%
summarize(reject = sum(jack.p <= 0.01 & Evennesscc > 0.85, na.rm = TRUE)/n()) %>%
spread(sample, reject) %>%
knitr::kable(digits = 3, caption = "Null Hypothesis Rejected (p <= 0.01 and E5A > 0.85)")
| sexrate | 10 | 25 | 50 | 100 |
|---|---|---|---|---|
| 0.0000 | 0.066 | 0.112 | 0.103 | 0.158 |
| 0.0001 | 0.110 | 0.083 | 0.111 | 0.112 |
| 0.0005 | 0.060 | 0.045 | 0.000 | 0.000 |
| 0.0010 | 0.012 | 0.000 | 0.000 | 0.000 |
| 0.0050 | 0.000 | 0.000 | 0.000 | 0.000 |
| 0.0100 | 0.000 | 0.000 | 0.000 | 0.000 |
| 0.0500 | 0.000 | 0.000 | 0.000 | 0.000 |
| 0.1000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 0.5000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1.0000 | NA | NA | NA | 0.000 |
jvals %>%
ggplot(aes(x = jack.p, group = mutation_rate, fill = mutation_rate)) +
geom_histogram(alpha = 0.5, binwidth = 0.05) +
facet_grid(sexrate~sample, scales = "free") +
ggtitle("P-value from jack-knife analysis weighted by psex")
I thought that maybe there might be some effect in bias where clonal samples would exhibit a greater bias, but this does not appear to be the case.
jvals %>%
unite(the_group, seed, rep, run, gen) %>%
ggplot(aes(x = sexrate, y = bias, fill = mutation_rate)) +
# geom_line(aes(group = the_group), alpha = 0.2) +
geom_boxplot(aes(group = factor(sexrate))) +
# geom_point(aes(group = the_group), alpha = 0.2) +
facet_grid(sample ~ mutation_rate) +
scale_y_log10() +
ggtitle("Progression of bias", subtitle = "full data set compared to sub-sample")
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 13202 rows containing non-finite values (stat_boxplot).
Clone-correction provides a means of assessing \(\bar{r}_d\) without the bias from duplicated genotypes (which tends to inflate the statistic) at the cost of bias from a decreased sample size. It is intuitive that we can account for the sample size bias by randomly sampling a subset of the data without replacement and recalculating \(\bar{r}_d\). The problem is that, while this removes the bias due to sample size, it’s still affected by the bias introduced from over-represented genotypes.
We can address both of these biases at the same time by weighting the samples by the probability of encountering the nth sample of the same genotype by chance, also known as \(p_{sex}\). This allows identical genotypes from different genets to be included in the analysis and gives us a better idea of what the population without identical-by-descent genotypes looks like.
An example of how this works can be seen in real data from the outbreak of Phytophthora ramorum in Curry County, OR.
library("poppr")
library("ggjoy")
data(Pram)
iard <- c("Ia", "rbarD") # vector to get stats from results
JHC <- Pram[pop = "JHallCr_OR"] # grabbing the Joe Hall Creek watershed
# calculating allele frequencies from entire epidemic
af <- rraf(Pram, by_pop = FALSE, mul = 2, res = "vector")
jia <- ia(JHC, sample = 999, quiet = TRUE, valuereturn = TRUE)
jiacc <- JHC %>% clonecorrect(NA) %>% ia(sample = 999, quiet = TRUE, valuereturn = TRUE)
jjak <- jack.ia(JHC, quiet = TRUE, reps = 999, use_psex = TRUE, method = "multiple", freq = af)
jjak2 <- jack.ia(JHC, quiet = TRUE, reps = 999)
(jbias <- map_dbl(jjak, mean, na.rm = TRUE))
## Ia rbarD
## 0.09525167 0.02444025
(jiap <- map2_dbl(jbias, jia$samples, function(a, b) (sum(a <= b, na.rm = TRUE) + 1)/(length(b) + 1)))
## Ia rbarD
## 0.002 0.002
df <- data_frame(rbarD = c(jia$index[3], jiacc$index[3], jbias[2]),
type = c("observed", "clone-censored", "estimate"))
bind_rows(null = jia$samples,
nullcc = jiacc$samples,
resampled.psex = jjak,
resampled = jjak2,
.id = "type") %>%
mutate(type = forcats::fct_rev(type)) %>%
group_by(type) %>%
mutate(rdmean = mean(rbarD, na.rm = TRUE)) %>%
ungroup() %>%
ggplot(aes(x = rbarD, y = type)) +
geom_joy(aes(fill = rdmean), alpha = 0.75, scale = 5) +
# geom_rug(aes(color = type), alpha = 0.1) +
geom_vline(xintercept = df$rbarD[-3], lty = 1:2, color = c("black", "grey30"), size = 1) +
annotate(geom = "text",
# size = 3.,
fontface = "bold",
color = c("black", "grey30"),
x = df$rbarD[-3],
y = c(6, 6) + 0.5,
label = df$type[-3],
hjust = c(0, 0) + -0.05,
angle = 35) +
viridis::scale_fill_viridis(direction = -1, limits = c(0, 0.2)) +
# scale_fill_brewer(palette = "Set1", labels = c("Null", "Null (clone-censored)", "Resampled", "Resampled (Psex)")) +
# scale_color_brewer(palette = "Set1", labels = c("Null", "Null (clone-censored)", "Resampled", "Resampled (Psex)")) +
theme_joy(font_size = 16, font_family = "Helvetica", grid = FALSE) +
theme(aspect.ratio = 0.68) +
theme(legend.position = "none") +
scale_y_discrete(breaks = c("null", "nullcc", "resampled.psex", "resampled"),
labels = c("Null", "Null (clone-censored)", "Resampled (Psex)", "Resampled"),
expand = c(0, 0)) +
scale_x_continuous(expand = c(0, 0), limits = c(-0.06, 0.3)) +
theme(axis.title.x = element_text(hjust = 0.5)) +
theme(axis.title.y = element_blank()) +
# theme(panel.border = element_blank()) +
theme(panel.grid.major.x = element_line(color = "grey20", linetype = 3)) +
# theme(panel.grid.minor.x = element_line(color = "grey50")) +
xlab(eval(quote(expression(bar(r)[d]))))
## Picking joint bandwidth of 0.0063
## Warning: Removed 33 rows containing non-finite values (stat_joy).
ggsave(here::here("manuscript", "figure", "pram.pdf"),
width = 7,
height = 4)
## Picking joint bandwidth of 0.0063
## Warning: Removed 33 rows containing non-finite values (stat_joy).
What we can see from this example is that the estimate here would easily be considered clonal when compared to the Full null distribution. However, when compared to the clone-censored null, the estimate is still non-significant due to the wide variance exhibited. Another issue arises when considering the fact that this can only be performed on a small number of markers considering that \(p_{gen}\), and therefore \(p_{sex}\) approaches zero as the number of loci increases. However, we can see that this approach does detect the non-random mating in populations with a sex rate >1%.
Ultimately, there is still no good way of assessing clonality. We must accept the fact that \(\bar{r}_d\) was simply not designed to be this sensitive to different levels of sexual reproduction. It may be more important to think about what the biological consequence of 1% sexual reproduction in a population is.